Team 4: Sam Cole, Alicia Hauglie, & Sai Gugulothu
Our data came from MLB.com, which has MLB player stats dating all the way back to 1876. We chose to focus on Active players, or those who are currently playing in the league. This dataset has hitting stats for all active players that have at least 3.1 plate appearances per team game played. These are the players that are starting almost every game for each team and getting the most at-bats in the league. So basically, these are the starting/regular players of each of the 30 MLB teams. In this data you can see player positions, batting averages, RBIs, and home runs, as well as many other stats for each player. The top 1000 rankings are included in our dataset. Ranking is determined solely by AVG, or batting average. Each row, or rank, is the stats for one player for one year, so there are players with multiple rows, as they have played multiple years. It’s interesting to note that there are 1001 rows, this is because two players have the same ranking at some point, giving us players in 1000 ranks. The MLB consists of both the American League and the National League, the only major difference between the two are what teams are in each and in the American League there is a “designated hitter” position that is not in the NL. It is necessary to include some definitions in order to better understand our data and the game of baseball in general.
RK: rank. The rank of the player (in a given year) based on their batting average.
Player: Name of the player.
Year: Year in which the player’s stats are from.
Team: The abbreviation of the 30 MLB team names. In the Stadiums file, “Team Name” shows the full name of the teams.
Pos: Player’s position.
- 1B: first base
- SS: shortstop
- 2B: second base
- 3B: third base
- CF: center field
- RF: right field
- LF: left field
- C: catcher
- DH: designated hitter (only in the AL)
- OF: outfielder
G: number of games in which the player appeared (in a given year).
AB: number of official at bats by a batter. (This is plate appearances minus sacrifices, walks, and “hit by pitches”.)
R: runs. The number of times a baserunner safely reaches home plate.
H: hits. The number of times a batter hits the ball and reaches a base safely (without the aid of an error.)
2B: number of times a batter hits the ball and reaches second base.
3B: number of times a batter hits the ball and reaches third base.
HR: numer of times a batter hits the ball and gets a home run.
RBI: runs batted in. The number of runs that come from a batter hitting the ball. (If bases are loaded and batter hits a HR, RBI is 4)
BB: walks. Four balls in an bat.
SO: strikeouts. Three strikes during an at bat.
SB: stolen base. Number of times a player has stolen a base.
CS: caught stealing. Number of times a player has gotten out while trying to steal a base.
AVG: batting average. The chance a player has of getting a hit during an at bat.
OBP: on base percentage. The chance a player will get on base during an at bat. How frequently they get on base per plate appearance.
SLG: slugging percent. The same as batting average but it takes into account singles, doubles, triples, and HRs. A higher SLG means a player is more “productive” when hitting.
OPS: on base plus slugging percentage. This is the ability of a player to get on base AND hit for power.
SF: number of times a runner tags up and scores after a batter’s fly out.
AO: fly outs. Total number of times a batter hit the ball and it was caught in the air, resulting in an out.
GO: ground outs. Number of times a batter has gotten out on a ground ball.
PA: plate appearances.
NP: number of pitches thrown during all of the batter’s plate appearances.
RBIAB: runs batted in per at bat.
HRAB: home runs per at bat.
BABIP: batting average on balls in play. When a player makes contact with the ball, what’s the chance they’ll get a hit? This does not account for strikeouts (because the ball is not put into play).
NPPA: number of pitches per plate appearance.
NPAB: number of pitches per at bat.
SOAB: number of strikeouts per at bat.
What do the batting average and OPS look like for all active players?
Which position has the best batting average?
What stats have strong correlations to one another?
Does getting more pitches in an at Bat increase the odds of hitting a homerun?
Which position has the best batting average?
What teams have the best batting averages?
How did switching teams affect Albert Pujol’s stats?
Why is Mike Trout considered such a well-rounded player (possibly the greatest of all time)?
Do Homerun hitters have higher Strike Out percentages?
What makes the Yankees and Red Sox such a good rivalry?
Does the average number of pitches in an at-bat correlate with batting average? Does it correlate with HRs?
Where does Buster Posey fall in terms of average BABIP?
Which teams hit the most homeruns?
Where are the stadiums of the 30 MLB teams?
Originally, we spent a large amount of time trying to scrape the data from the website using the SelectorGadget. After running into issues with expanding tables on the page, we decided to simply create an excel spreadsheet of the data, which we then imported and named the dataframe “dat”. To clean the data we removed the * before player names, as well as deleted the columns Player2 and Player3, which were just repeating the same info as Player, and a few variables were changed from characters to numeric. In order to answer some of our questions, we created a few variables by mutating existing columns. The variables created were: RBIAB, HRAB, BABIP, NP, NPPA, NPAB, and SOAB.
We began by getting some simple visualizations of the basic stats in our data:
What do the batting average and OPS look like for all active players?This shows us the batting averages for all active players (with each data point being a certain player in a certain year). If an AVG is .300 (or 300), that means the player has a 30% chance of getting a hit. For reference, 350 is a very good batting average. The outliers are players who played just a few games in a season, and happened to play really well during those games (thus making them very high outliers). This alone does not tell us how good a player is as different players have different goals, some hit to get on base, some hit with the goals of HRs, etc. The distribution would be bell shaped if worse-ranked players were also included (players who have AVGs less than .25). Since our dataset only includes player with at least 3.1 plate appearances per game, it would not include players with AVGs less than .25 because they would get pulled for not hitting well and therefore would not get to 3.1 plate appearances.
Both batting average and OPS are bell-shaped for active players.
Which position has the best batting average?| Pos | MEANAVG |
|---|---|
| 1B | 0.293 |
| SS | 0.284 |
| 2B | 0.289 |
| 3B | 0.285 |
| CF | 0.286 |
| RF | 0.284 |
| LF | 0.290 |
| C | 0.288 |
| DH | 0.275 |
| OF | 0.288 |
We found that Designated Hitters actually have the WORST batting averages. This could possibly be because we have less data for DH since they’re only in the American League. If we had data for ALL players in the MLB, not just the ones with 3.1+ plate appearances, we would see a different story, averages for all the positions would be much closer to one another and DH would be among the positions with the higher batting averages. We think that our data may show them as the worst hitters because most of the players who are DH’s are older players who are put in this position because they’re not very good at fielding anymore. Generally, the best players are going to bat and play the field.
What stats have strong correlations to one another?
To find what stats have strong correlations to one another we created a correlation matrix using several of the variables in our data and produced a heatmap to better visualize the relationships.
We started by graphing a bunch of scatterplots, going through variables two at a time to see their relationships, until we realized we could just use a correlation matrix and see all the stats at the same time. As you can see, there is a lot of red, and not a lot of green/grey. This is because most hitting stats in baseball are positive towards the hitter and therefore positive numbers. The only 2 stats that are inverses are AO and SO (Fly-Outs and Strike-Outs), and this is shown by these being the only grey points on the matrix. The strongest inverse is seen with Fly-Outs and BABIP. This makes sense as BABIP only takes into account balls that are in play, so if a ball is caught in the air and there is an out, the ball is not put into play. You can see a lot of Very Strong correlations, because a lot of the stats measure very similar things, for example: OBP and AVG.
Does getting more pitches in an at Bat increase the odds of hitting a homerun?
After using a scatterplot to visualize the realtionship between HRPA and NPPA and finding the R^2, we were able to conclude that there is no real evidence of a correlation between the two.
What teams have the best batting averages?
| Team | MEANAVG |
|---|---|
| FLA | 0.303 |
| STL | 0.297 |
| WSH | 0.296 |
| COL | 0.295 |
| DET | 0.294 |
We found that the teams with the top 5 best batting averages in our dataset were (in order): The Miami Marlins, The St. Louis Cardnials, The Washington Nationals, The Colorado Rockies, and The Detroit Tigers. Upon looking at the players for these teams, we saw that Albert Pujols holds 2 of the top 10 rankings in our data (so his batting average was super high in these years) and was playing for the Cardinals at the time, which is a large contributing factor as to why they have one of the best team batting averages. This leads us into our next question…
How did switching teams affect Albert Pujols’s stats?
Why is Mike Trout considered such a well-rounded player (possibly the greatest of all time)?
Mike Trout is currently the best baseball player in the entire world, and some people think that he will be considered the best player that has ever played the game. Just like with Albert Pujols, pitchers are scared to pitch to him, and you can see he accumulates a lot of walks each season. We chose to look at Stolen Bases as well with Trout, as he is known to be fast, unlike Pujols who is quite slow. The vertical black line in this case represents 2017. In 2017 he had a thumb injury which caused him to miss a third of the season, Trout played only 114 games out of 162, missing 48. In both 2016 and 2018, trout played the majority of the season not missing more than 20 games each year. You can see with the bottom three lines, (HR, SB, and XBH) that he came back after being injured in 2017 and had an incredible season, almost matching his prior stats and even hitting more home runs than the year before. We would like to think of this as a testament as to what kind of a player Mike Trout is, the stats are proof of his greatness.
Do Homerun hitters have higher Strike Out percentages?
When we looked at the correlation between Home Runs and Strikeouts per at bat, the R^2 was only .162, which would indicate to us that our data does not show a relationship between homerun hitters having higher strike out percentages.
What makes the Yankees and Red Sox such a good rivalry?
When trying to answer our question about why the Red Sox and Yankees rivalry is so intense, we wanted to see if there was a difference in playing styles. What made us decide on comparing batting average and average number of homeruns each player hits on each team, was Sam’s live-long love of watching and playing baseball. The yankees are always thought of as big home run hitters, and the Red Sox are thought to have very consistent solid hitters. So, we wanted to see if this was the case, and when we graphed it, we could see right away that our assumptions were true. The Red Sox have a much higher batting average, and the Yankees hit almost 5 more home runs per player each season. In answering the question, the reason these games are so fun to watch is the different hitting styles each team uses, one with an emphasis on batting average, the other with emphasis on home runs. Because the teams have different strengths, it makes a very close match-up and an exciting rivalry with nail-biting games.
Does the average number of pitches in an at-bat correlate with batting average? Does it correlate with HRs?
For this question we looked at batting average and number of pitches per at bat, this showed no correlation. When we compared home runs and number of pitches per at bat we got a very weak positive correlation, but not enough to draw any conclusions about the two. We chose to include questions in which we found no correlation between variables because these are things we were curious about when looking at our data, and it’s important to note that sometimes your findings aren’t as fun or noteworthy as you would like. We wanted to include the full story, not just the shocking findings.
Where does Buster Posey fall in terms of average BABIP?
To answer this question, we created the BABIP (batting avg on balls in play), using the formula: BABIP = (H – HR)/(AB – SO – HR + SF). From this, we wanted to see where Sam’s favorite player stood versus the rest of the players in our data, so we created this Histogram. Highlighted in orange (which is the color of the Giants, who he plays for) is Buster Posey’s average BABIP over his career, and you can see that he is above the average of all the players, as he is considered to be one of the best hitting catchers to ever play. This is still an accomplishment for Posey, as he is getting old, but continuing to hit well enough to remain above the average.
Which teams hit the most homeruns?| Team | MeanHR |
|---|---|
| TOR | 0.049 |
| BAL | 0.045 |
| MIL | 0.045 |
| LAD | 0.044 |
| STL | 0.044 |
| FLA | 0.042 |
| CHC | 0.041 |
| COL | 0.040 |
| SEA | 0.040 |
| WSH | 0.039 |
| DET | 0.039 |
| ARI | 0.039 |
| NYY | 0.038 |
| CIN | 0.038 |
| TEX | 0.037 |
| HOU | 0.037 |
| TB | 0.036 |
| CLE | 0.035 |
| LAA | 0.035 |
| MIN | 0.034 |
| BOS | 0.034 |
| OAK | 0.034 |
| CWS | 0.033 |
| PHI | 0.033 |
| NYM | 0.032 |
| ATL | 0.032 |
| SD | 0.031 |
| MIA | 0.030 |
| SF | 0.029 |
| PIT | 0.029 |
| KC | 0.029 |
| ANA | 0.028 |
| MON | 0.010 |
As you can see in the table, the top 5 highest HR hitting teams are: The Toronto Bluejays, The Baltimore Orioles, The Milwaukee Brewers, The Los Angeles Dodgers, and The St. Louis Cardinals. If you remember our question about team batting averages, the Cardinals were in the top 5 there as well, so they have both a high batting average as well as a high number of homeruns hit.
Where are the stadiums of the 30 MLB teams?
We added this question after learning how to create choropleth maps, we wanted to see how cool we could make a map of the MLB stadium locations look. The points for each team’s stadium are their team colors, which we found the exact hex codes for. We used an excel file with the team’s lat and long locations to plot the points on a map. Using ggrepel we labeled the points with the shortened team names, and then included the team’s batting averages as well. It took quite a bit of code to create this map but it was enjoyable to learn more involving choropleth maps.
In working with this MLB data we had some very interesting findings, got very real hands-on coding experience, and put into practice the data acquisition and exploration skills we had learned in class throughout the semester. It was fun to get to use data we chose ourselves and collaborate on making a meaningful project using the data. Finding out answers to our questions using code we wrote ourselves was satisfying, especially for questions that had surprising answers like “Which position has the best batting average?” or “Did Albert Pujols’s stats drop when he switched teams?” Creating visualizations that could tell our story in an easily understandable way was difficult but rewarding and will be a useful skill to have in the future and it’s something we will continue to improve upon as we keep working with more data.
Sam: I started with figuring out what data set we were going to use. Since I have the most baseball knowledge out of the group, I came up with all the questions that we were going to answer. I then did a large majority of the coding, making the graphs, and answering the questions. I set up the skeleton of the presentation, and explained what Alicia and I were going to say for each graph/slide on the powerpoint. For the project report, I wrote about half of the graph explanations and conclusions, and Alicia put them into this Rmd.
Alicia: A majority of the work I did on the project was doing aesthetic things, such as tweaking graphs or creating the powerpoint. I helped Sam write the code when he got stuck on certain bits, and we mostly worked on the project when we were together so we could collaborate and have two heads solving problems rather than just one. I created our slides and helped make our visuals easily understandable with the help of Sam’s input. I made the choropleth map of the team locations, which took me quite a while and quite a long bit of code to get it where I wanted it to be. Sam and I wrote out the project report and decided which graphs and charts to include, and then I went through and made sure it was fluid.
Sai: